AAAI.2024 - Natural Language Processing | Cool Papers

#1 Frame Semantic Role Labeling Using Arbitrary-Order Conditional Random Fields [PDF²] [Copy] [Kimi¹²]

This paper presents an approach to frame semantic role labeling (FSRL), a task in natural language processing that identifies semantic roles within a text following the theory of frame semantics. Unlike previous approaches which do not adequately model correlations and interactions amongst arguments, we propose arbitrary-order conditional random fields (CRFs) that are capable of modeling full interaction amongst an arbitrary number of arguments of a given predicate. To achieve tractable representation and inference, we apply canonical polyadic decomposition to the arbitrary-order factor in our proposed CRF and utilize mean-field variational inference for approximate inference. We further unfold our iterative inference procedure into a recurrent neural network that is connected to our neural encoder and scorer, enabling end-to-end training and inference. Finally, we also improve our model with several techniques such as span-based scoring and decoding. Our experiments show that our approach achieves state-of-the-art performance in FSRL.

#2 DTF-AT: Decoupled Time-Frequency Audio Transformer for Event Classification [PDF] [Copy] [Kimi⁴]

Authors: Tony Alex ; Sara Ahmed ; Armin Mustafa ; Muhammad Awais ; Philip JB Jackson

Convolutional neural networks (CNNs) and Transformer-based networks have recently enjoyed significant attention for various audio classification and tagging tasks following their wide adoption in the computer vision domain. Despite the difference in information distribution between audio spectrograms and natural images, there has been limited exploration of effective information retrieval from spectrograms using domain-specific layers tailored for the audio domain. In this paper, we leverage the power of the Multi-Axis Vision Transformer (MaxViT) to create DTF-AT (Decoupled Time-Frequency Audio Transformer) that facilitates interactions across time, frequency, spatial, and channel dimensions. The proposed DTF-AT architecture is rigorously evaluated across diverse audio and speech classification tasks, consistently establishing new benchmarks for state-of-the-art (SOTA) performance. Notably, on the challenging AudioSet 2M classification task, our approach demonstrates a substantial improvement of 4.4% when the model is trained from scratch and 3.2% when the model is initialised from ImageNet-1K pretrained weights. In addition, we present comprehensive ablation studies to investigate the impact and efficacy of our proposed approach. The codebase and pretrained weights are available on https://github.com/ta012/DTFAT.git

#3 WikiSQE: A Large-Scale Dataset for Sentence Quality Estimation in Wikipedia [PDF¹] [Copy] [Kimi⁵]

Authors: Kenichiro Ando ; Satoshi Sekine ; Mamoru Komachi

Wikipedia can be edited by anyone and thus contains various quality sentences. Therefore, Wikipedia includes some poor-quality edits, which are often marked up by other editors. While editors' reviews enhance the credibility of Wikipedia, it is hard to check all edited text. Assisting in this process is very important, but a large and comprehensive dataset for studying it does not currently exist. Here, we propose WikiSQE, the first large-scale dataset for sentence quality estimation in Wikipedia. Each sentence is extracted from the entire revision history of English Wikipedia, and the target quality labels were carefully investigated and selected. WikiSQE has about 3.4 M sentences with 153 quality labels. In the experiment with automatic classification using competitive machine learning models, sentences that had problems with citation, syntax/semantics, or propositions were found to be more difficult to detect. In addition, by performing human annotation, we found that the model we developed performed better than the crowdsourced workers. WikiSQE is expected to be a valuable resource for other tasks in NLP.

#4 Beyond Grounding: Extracting Fine-Grained Event Hierarchies across Modalities [PDF] [Copy] [Kimi¹]

Authors: Hammad Ayyubi ; Christopher Thomas ; Lovish Chum ; Rahul Lokesh ; Long Chen ; Yulei Niu ; Xudong Lin ; Xuande Feng ; Jaywon Koo ; Sounak Ray ; Shih-Fu Chang

Events describe happenings in our world that are of importance. Naturally, understanding events mentioned in multimedia content and how they are related forms an important way of comprehending our world. Existing literature can infer if events across textual and visual (video) domains are identical (via grounding) and thus, on the same semantic level. However, grounding fails to capture the intricate cross-event relations that exist due to the same events being referred to on many semantic levels. For example, the abstract event of "war'' manifests at a lower semantic level through subevents "tanks firing'' (in video) and airplane "shot'' (in text), leading to a hierarchical, multimodal relationship between the events. In this paper, we propose the task of extracting event hierarchies from multimodal (video and text) data to capture how the same event manifests itself in different modalities at different semantic levels. This reveals the structure of events and is critical to understanding them. To support research on this task, we introduce the Multimodal Hierarchical Events (MultiHiEve) dataset. Unlike prior video-language datasets, MultiHiEve is composed of news video-article pairs, which makes it rich in event hierarchies. We densely annotate a part of the dataset to construct the test benchmark. We show the limitations of state-of-the-art unimodal and multimodal baselines on this task. Further, we address these limitations via a new weakly supervised model, leveraging only unannotated video-article pairs from MultiHiEve. We perform a thorough evaluation of our proposed method which demonstrates improved performance on this task and highlight opportunities for future research. Data: https://github.com/hayyubi/multihieve

#5 All Should Be Equal in the Eyes of LMs: Counterfactually Aware Fair Text Generation [PDF²] [Copy] [Kimi²]

Authors: Pragyan Banerjee ; Abhinav Java ; Surgan Jandial ; Simra Shahid ; Shaz Furniturewala ; Balaji Krishnamurthy ; Sumit Bhatia

Fairness in Language Models (LMs) remains a long-standing challenge, given the inherent biases in training data that can be perpetuated by models and affect the downstream tasks. Recent methods employ expensive retraining or attempt debiasing during inference by constraining model outputs to contrast from a reference set of biased templates/exemplars. Regardless, they don’t address the primary goal of fairness to maintain equitability across different demographic groups. In this work, we posit that inferencing LMs to generate unbiased output for one demographic under a context ensues from being aware of outputs for other demographics under the same context. To this end, we propose Counterfactually Aware Fair InferencE (CAFIE), a framework that dynamically compares the model’s understanding of diverse demographics to generate more equitable sentences. We conduct an extensive empirical evaluation using base LMs of varying sizes and across three diverse datasets and found that CAFIE outperforms strong baselines. CAFIE produces fairer text and strikes the best balance between fairness and language modeling capability.

#6 Graph of Thoughts: Solving Elaborate Problems with Large Language Models [PDF²] [Copy] [Kimi⁴]

Authors: Maciej Besta ; Nils Blach ; Ales Kubicek ; Robert Gerstenberger ; Michal Podstawski ; Lukas Gianinazzi ; Joanna Gajda ; Tomasz Lehmann ; Hubert Niewiadomski ; Piotr Nyczyk ; Torsten Hoefler

We introduce Graph of Thoughts (GoT): a framework that advances prompting capabilities in large language models (LLMs) beyond those offered by paradigms such as Chain-of-Thought or Tree of Thoughts (ToT). The key idea and primary advantage of GoT is the ability to model the information generated by an LLM as an arbitrary graph, where units of information ("LLM thoughts") are vertices, and edges correspond to dependencies between these vertices. This approach enables combining arbitrary LLM thoughts into synergistic outcomes, distilling the essence of whole networks of thoughts, or enhancing thoughts using feedback loops. We illustrate that GoT offers advantages over state of the art on different tasks, for example increasing the quality of sorting by 62% over ToT, while simultaneously reducing costs by >31%. We ensure that GoT is extensible with new thought transformations and thus can be used to spearhead new prompting schemes. This work brings the LLM reasoning closer to human thinking or brain mechanisms such as recurrence, both of which form complex networks

#7 When Do Program-of-Thought Works for Reasoning? [PDF¹] [Copy] [Kimi³]

Authors: Zhen Bi ; Ningyu Zhang ; Yinuo Jiang ; Shumin Deng ; Guozhou Zheng ; Huajun Chen

In the realm of embodied artificial intelligence, the reasoning capabilities of Large Language Models (LLMs) play a pivotal role. Although there are effective methods like program-of-thought prompting for LLMs which uses programming language to tackle complex reasoning tasks, the specific impact of code data on the improvement of reasoning capabilities remains under-explored. To address this gap, we propose complexity-impacted reasoning score CIRS, which combines structural and logical attributes, to measure the correlation between code and reasoning abilities. Specifically, we use the abstract syntax tree to encode the structural information and calculate logical complexity by considering the difficulty and the cyclomatic complexity. Through an empirical analysis, we find not all code data of complexity can be learned or understood by LLMs. Optimal level of complexity is critical to the improvement of reasoning abilities by program-aided prompting. Then we design an auto-synthesizing and stratifying algorithm, and apply it to instruction generation for mathematical reasoning and code data filtering for code generation tasks. Extensive results demonstrates the effectiveness of our proposed approach.

#8 Beyond Attention: Breaking the Limits of Transformer Context Length with Recurrent Memory [PDF²] [Copy] [Kimi¹]

Authors: Aydar Bulatov ; Yuri Kuratov ; Yermek Kapushev ; Mikhail Burtsev

A major limitation for the broader scope of problems solvable by transformers is the quadratic scaling of computational complexity with input size. In this study, we investigate the recurrent memory augmentation of pre-trained transformer models to extend input context length while linearly scaling compute. Our approach demonstrates the capability to store information in memory for sequences of up to an unprecedented two million tokens while maintaining high retrieval accuracy. Experiments with language modeling tasks show perplexity improvement as the number of processed input segments increases. These results underscore the effectiveness of our method, which has significant potential to enhance long-term dependency handling in natural language understanding and generation tasks, as well as enable large-scale context processing for memory-intensive applications.

#9 MedBench: A Large-Scale Chinese Benchmark for Evaluating Medical Large Language Models [PDF] [Copy] [Kimi¹]

Authors: Yan Cai ; Linlin Wang ; Ye Wang ; Gerard de Melo ; Ya Zhang ; Yanfeng Wang ; Liang He

The emergence of various medical large language models (LLMs) in the medical domain has highlighted the need for unified evaluation standards, as manual evaluation of LLMs proves to be time-consuming and labor-intensive. To address this issue, we introduce MedBench, a comprehensive benchmark for the Chinese medical domain, comprising 40,041 questions sourced from authentic examination exercises and medical reports of diverse branches of medicine. In particular, this benchmark is composed of four key components: the Chinese Medical Licensing Examination, the Resident Standardization Training Examination, the Doctor In-Charge Qualification Examination, and real-world clinic cases encompassing examinations, diagnoses, and treatments. MedBench replicates the educational progression and clinical practice experiences of doctors in Mainland China, thereby establish- ing itself as a credible benchmark for assessing the mastery of knowledge and reasoning abilities in medical language learning models. We perform extensive experiments and conduct an in-depth analysis from diverse perspectives, which culminate in the following findings: (1) Chinese medical LLMs underperform on this benchmark, highlighting the need for significant advances in clinical knowledge and diagnostic precision. (2) Several general-domain LLMs surprisingly possess considerable medical knowledge. These findings elucidate both the capabilities and limitations of LLMs within the context of MedBench, with the ultimate goal of aiding the medical research community.

#10 CAR-Transformer: Cross-Attention Reinforcement Transformer for Cross-Lingual Summarization [PDF²] [Copy] [Kimi²]

Authors: Yuang Cai ; Yuyu Yuan

Cross-Lingual Summarization (CLS) involves generating a summary for a given document in another language. Most of the existing approaches adopt multi-task training and knowledge distillation, which increases the training cost and improves the performance of CLS tasks intuitively but unexplainably. In this work, we propose Cross-Attention Reinforcement (CAR) module and incorporate the module into the transformer backbone to formulate the CAR-Transformer. The CAR module formulates a pseudo summarization policy parameterized by the cross-attention weights reinforced by the ground-truth monolingual summary without introducing extra model parameters. Our approach demonstrates more consistent improvement across CLS tasks compared to traditional multi-task training methods and outperforms the fine-tuned vanilla mBART by 3.67 and the best-performing multi-task training approach by 1.48 in ROUGE-L F1 score on the WikiLingua Korean-to-English CLS task.

#11 Compositional Generalization for Multi-Label Text Classification: A Data-Augmentation Approach [PDF¹] [Copy] [Kimi²]

Authors: Yuyang Chai ; Zhuang Li ; Jiahui Liu ; Lei Chen ; Fei Li ; Donghong Ji ; Chong Teng

Despite significant advancements in multi-label text classification, the ability of existing models to generalize to novel and seldom-encountered complex concepts, which are compositions of elementary ones, remains underexplored. This research addresses this gap. By creating unique data splits across three benchmarks, we assess the compositional generalization ability of existing multi-label text classification models. Our results show that these models often fail to generalize to compositional concepts encountered infrequently during training, leading to inferior performance on tests with these new combinations. To address this, we introduce a data augmentation method that leverages two innovative text generation models designed to enhance the classification models' capacity for compositional generalization. Our experiments show that this data augmentation approach significantly improves the compositional generalization capabilities of classification models on our benchmarks, with both generation models surpassing other text generation baselines. Our codes available at https://github.com/yychai74/LD-VAE.

#12 Counterfactual-Enhanced Information Bottleneck for Aspect-Based Sentiment Analysis [PDF⁴] [Copy] [Kimi⁴]

Authors: Mingshan Chang ; Min Yang ; Qingshan Jiang ; Ruifeng Xu

Despite having achieved notable success for aspect-based sentiment analysis (ABSA), deep neural networks are susceptible to spurious correlations between input features and output labels, leading to poor robustness. In this paper, we propose a novel Counterfactual-Enhanced Information Bottleneck framework (called CEIB) to reduce spurious correlations for ABSA. CEIB extends the information bottleneck (IB) principle to a factual-counterfactual balancing setting by integrating augmented counterfactual data, with the goal of learning a robust ABSA model. Concretely, we first devise a multi-pattern prompting method, which utilizes the large language model (LLM) to generate high-quality counterfactual samples from the original samples. Then, we employ the information bottleneck principle and separate the mutual information into factual and counterfactual parts. In this way, we can learn effective and robust representations for the ABSA task by balancing the predictive information of these two parts. Extensive experiments on five benchmark ABSA datasets show that our CEIB approach achieves superior prediction performance and robustness over the state-of-the-art baselines. Code and data to reproduce the results in this paper is available at: https://github.com/shesshan/CEIB.

#13 Visual Instruction Tuning with Polite Flamingo [PDF¹] [Copy] [Kimi²]

Authors: Delong Chen ; Jianfeng Liu ; Wenliang Dai ; Baoyuan Wang

Recent research has demonstrated that the multi-task fine-tuning of multi-modal Large Language Models (LLMs) using an assortment of annotated downstream vision-language datasets significantly enhances their performance. Yet, during this process, a side effect, which we termed as the "multi-modal alignment tax", surfaces. This side effect negatively impacts the model's ability to format responses appropriately - for instance, its "politeness" - due to the overly succinct and unformatted nature of raw annotations, resulting in reduced human preference. In this paper, we introduce Polite Flamingo, a multi-modal response rewriter that transforms raw annotations into a more appealing, "polite" format. Polite Flamingo is trained to reconstruct high-quality responses from their automatically distorted counterparts and is subsequently applied to a vast array of vision-language datasets for response rewriting. After rigorous filtering, we generate the PF-1M dataset and further validate its value by fine-tuning a multi-modal LLM with it. Combined with novel methodologies including U-shaped multi-stage tuning and multi-turn augmentation, the resulting model, Clever Flamingo, demonstrates its advantages in both multi-modal understanding and response politeness according to automated and human evaluations. Code and dataset are available at https://github.com/ChenDelong1999/polite-flamingo

#14 Benchmarking Large Language Models in Retrieval-Augmented Generation [PDF⁵] [Copy] [Kimi³]

Authors: Jiawei Chen ; Hongyu Lin ; Xianpei Han ; Le Sun

Retrieval-Augmented Generation (RAG) is a promising approach for mitigating the hallucination of large language models (LLMs). However, existing research lacks rigorous evaluation of the impact of retrieval-augmented generation on different large language models, which make it challenging to identify the potential bottlenecks in the capabilities of RAG for different LLMs. In this paper, we systematically investigate the impact of Retrieval-Augmented Generation on large language models. We analyze the performance of different large language models in 4 fundamental abilities required for RAG, including noise robustness, negative rejection, information integration, and counterfactual robustness. To this end, we establish Retrieval-Augmented Generation Benchmark (RGB), a new corpus for RAG evaluation in both English and Chinese. RGB divides the instances within the benchmark into 4 separate testbeds based on the aforementioned fundamental abilities required to resolve the case. Then we evaluate 6 representative LLMs on RGB to diagnose the challenges of current LLMs when applying RAG. Evaluation reveals that while LLMs exhibit a certain degree of noise robustness, they still struggle significantly in terms of negative rejection, information integration, and dealing with false information. The aforementioned assessment outcomes indicate that there is still a considerable journey ahead to effectively apply RAG to LLMs.

#15 CIDR: A Cooperative Integrated Dynamic Refining Method for Minimal Feature Removal Problem [PDF] [Copy] [Kimi]

Authors: Qian Chen ; Taolin Zhang ; Dongyang Li ; Xiaofeng He

The minimal feature removal problem in the post-hoc explanation area aims to identify the minimal feature set (MFS). Prior studies using the greedy algorithm to calculate the minimal feature set lack the exploration of feature interactions under a monotonic assumption which cannot be satisfied in general scenarios. In order to address the above limitations, we propose a Cooperative Integrated Dynamic Refining method (CIDR) to efficiently discover minimal feature sets. Specifically, we design Cooperative Integrated Gradients (CIG) to detect interactions between features. By incorporating CIG and characteristics of the minimal feature set, we transform the minimal feature removal problem into a knapsack problem. Additionally, we devise an auxiliary Minimal Feature Refinement algorithm to determine the minimal feature set from numerous candidate sets. To the best of our knowledge, our work is the first to address the minimal feature removal problem in the field of natural language processing. Extensive experiments demonstrate that CIDR is capable of tracing representative minimal feature sets with improved interpretability across various models and datasets.

#16 Is a Large Language Model a Good Annotator for Event Extraction? [PDF³] [Copy] [Kimi¹]

Authors: Ruirui Chen ; Chengwei Qin ; Weifeng Jiang ; Dongkyu Choi

Event extraction is an important task in natural language processing that focuses on mining event-related information from unstructured text. Despite considerable advancements, it is still challenging to achieve satisfactory performance in this task, and issues like data scarcity and imbalance obstruct progress. In this paper, we introduce an innovative approach where we employ Large Language Models (LLMs) as expert annotators for event extraction. We strategically include sample data from the training dataset in the prompt as a reference, ensuring alignment between the data distribution of LLM-generated samples and that of the benchmark dataset. This enables us to craft an augmented dataset that complements existing benchmarks, alleviating the challenges of data imbalance and scarcity and thereby enhancing the performance of fine-tuned models. We conducted extensive experiments to validate the efficacy of our proposed method, and we believe that this approach holds great potential for propelling the development and application of more advanced and reliable event extraction systems in real-world scenarios.

#17 Modeling Adaptive Inter-Task Feature Interactions via Sentiment-Aware Contrastive Learning for Joint Aspect-Sentiment Prediction [PDF²] [Copy] [Kimi¹]

Authors: Wei Chen ; Yuxuan Liu ; Zhao Zhang ; Fuzhen Zhuang ; Jiang Zhong

Aspect prediction (AP) and sentiment prediction (SP) are representative applications in fine-grained sentiment anal- ysis. They can be considered as sequential tasks, where AP identifies mentioned aspects in a sentence, and SP infers fine-grained sentiments for these aspects. Recent models perform the aspect-sentiment prediction in a joint man-ner, but heavily rely on the feature interactions of aspect and sentiment. One drawback is that they ignore correlation strength varies between aspect features and sentiment fea- tures across different sentences, and employ a fixed feature interaction strategy may limit effective knowledge transfer across tasks. To tackle this issue, in this paper, we propose an Adaptive Inter-task Feature Interaction framework, AIFI, for joint aspect-sentiment prediction. Specifically, we introduce a novel contrast-based alignment method based on contrastive learning. Our approach considers the AP-specific and SP-specific representations of a given sentence as a positive pair, while representation of another random sentence serves as a negative example. Moreover, we propose an inter-task feature correlation network to predict the contrast strength, which is determined by the temperature coefficient in the InfoNCE loss. This dynamic correlation adjustment enhances model’s ability to capture proper feature interactions more efficiently. Experimental results on three datasets validate the effectiveness of our approach.

#18 From Coarse to Fine: A Distillation Method for Fine-Grained Emotion-Causal Span Pair Extraction in Conversation [PDF¹] [Copy] [Kimi¹]

Authors: Xinhao Chen ; Chong Yang ; Changzhi Sun ; Man Lan ; Aimin Zhou

We study the problem of extracting emotions and the causes behind these emotions in conversations. Existing methods either tackle them separately or jointly model them at the coarse-grained level of emotions (fewer emotion categories) and causes (utterance-level causes). In this work, we aim to jointly extract more fine-grained emotions and causes. We construct a fine-grained dataset FG-RECCON, includes 16 fine-grained emotion categories and span-level causes. To further improve the fine-grained extraction performance, we propose to utilize the casual discourse knowledge in a knowledge distillation way. Specifically, the teacher model learns to predict causal connective words between utterances, and then guides the student model in identifying both the fine-grained emotion labels and causal spans. Experimental results demonstrate that our distillation method achieves state-of-the-art performance on both RECCON and FG-RECCON dataset.

#19 Divergence-Guided Simultaneous Speech Translation [PDF] [Copy] [Kimi]

Authors: Xinjie Chen ; Kai Fan ; Wei Luo ; Linlin Zhang ; Libo Zhao ; Xinggao Liu ; Zhongqiang Huang

To achieve high-quality translation with low latency, a Simultaneous Speech Translation (SimulST) system relies on a policy module to decide whether to translate immediately or wait for additional streaming input, along with a translation model capable of effectively handling partial speech input. Prior research has tackled these components separately, either using ``wait-k'' policies based on fixed-length segments or detected word boundaries, or dynamic policies based on different strategies (e.g., meaningful units), while employing offline models for prefix-to-prefix translation. In this paper, we propose Divergence-Guided Simultaneous Speech Translation (DiG-SST), a tightly integrated approach focusing on both translation quality and latency for streaming input. Specifically, we introduce a simple yet effective prefix-based strategy for training translation models with partial speech input, and develop an adaptive policy that makes read/write decisions for the translation model based on the expected divergence in translation distributions resulting from future input. Our experiments on multiple translation directions of the MuST-C benchmark demonstrate that our approach achieves a better trade-off between translation quality and latency compared to existing methods.

#20 Benchmarking Large Language Models on Controllable Generation under Diversified Instructions [PDF] [Copy] [Kimi¹]

Authors: Yihan Chen ; Benfeng Xu ; Quan Wang ; Yi Liu ; Zhendong Mao

While large language models (LLMs) have exhibited impressive instruction-following capabilities, it is still unclear whether and to what extent they can respond to explicit constraints that might be entailed in various instructions. As a significant aspect of LLM alignment, it is thus important to formulate such a specialized set of instructions as well as investigate the resulting behavior of LLMs. To address this vacancy, we propose a new benchmark CoDI-Eval to systematically and comprehensively evaluate LLMs' responses to instructions with various constraints. We construct a large collection of constraints-attributed instructions as a test suite focused on both generalization and coverage. Specifically, we advocate an instruction diversification process to synthesize diverse forms of constraint expression and also deliberate the candidate task taxonomy with even finer-grained sub-categories. Finally, we automate the entire evaluation process to facilitate further developments. Different from existing studies on controllable text generation, CoDI-Eval extends the scope to the prevalent instruction-following paradigm for the first time. We provide extensive evaluations of representative LLMs (e.g., ChatGPT, Vicuna) on CoDI-Eval, revealing their limitations in following instructions with specific constraints and there is still a significant gap between open-source and commercial closed-source LLMs. We believe this benchmark will facilitate research into improving the controllability of LLMs' responses to instructions. Our data and code are available at https://github.com/Xt-cyh/CoDI-Eval.

#21 Journey to the Center of the Knowledge Neurons: Discoveries of Language-Independent Knowledge Neurons and Degenerate Knowledge Neurons [PDF] [Copy] [Kimi¹]

Authors: Yuheng Chen ; Pengfei Cao ; Yubo Chen ; Kang Liu ; Jun Zhao

Pre-trained language models (PLMs) contain vast amounts of factual knowledge, but how the knowledge is stored in the parameters remains unclear. This paper delves into the complex task of understanding how factual knowledge is stored in multilingual PLMs, and introduces the Architecture-adapted Multilingual Integrated Gradients method, which successfully localizes knowledge neurons more precisely compared to current methods, and is more universal across various architectures and languages. Moreover, we conduct an in-depth exploration on knowledge neurons, leading to the following two important discoveries: (1) The discovery of Language-Independent Knowledge Neurons, which store factual knowledge in a form that transcends language. We design cross-lingual knowledge editing experiments, demonstrating that the PLMs can accomplish this task based on language-independent neurons; (2) The discovery of Degenerate Knowledge Neurons, a novel type of neuron showing that different knowledge neurons can store the same fact. Its property of functional overlap endows the PLMs with a robust mastery of factual knowledge. We design fact-checking experiments, proving that the degenerate knowledge neurons can help the PLMs to detect wrong facts. Experiments corroborate these findings, shedding light on the mechanisms of factual knowledge storage in multilingual PLMs, and contribute valuable insights to the field. The code is available at https://github.com/heng840/AMIG.

#22 Talk Funny! A Large-Scale Humor Response Dataset with Chain-of-Humor Interpretation [PDF¹] [Copy] [Kimi¹]

Authors: Yuyan Chen ; Yichen Yuan ; Panjun Liu ; Dayiheng Liu ; Qinghao Guan ; Mengfei Guo ; Haiming Peng ; Bang Liu ; Zhixu Li ; Yanghua Xiao

Humor is a crucial part of human communication. Understanding humor and generating humorous responses in dialogue can provide natural and empathic human-computer interactions. However, most existing pre-trained language models (PLMs) perform unsatisfactorily in humor generation. On one hand, the serious shortage of humor corpus and datasets pose challenges for constructing models that can understand and generate humorous expressions. On the other hand, humor generation relies on rich knowledge and commonsense, which is often tacit and unspoken. In this paper, we construct the largest Chinese Explainable Humor Response Dataset to date with chain-of-humor and humor mind map annotations, which can be used to comprehensively evaluate as well as improve the humorous response ability of PLMs. We further design humor-related auxiliary tasks to further enhance PLMs' humorous response performance. Extensive evaluations demonstrate that our proposed dataset and auxiliary tasks effectively help PLMs to generate humorous responses, laying the groundwork for future humor research.

#23 Editing Language Model-Based Knowledge Graph Embeddings [PDF] [Copy] [Kimi²]

Authors: Siyuan Cheng ; Ningyu Zhang ; Bozhong Tian ; Xi Chen ; Qingbin Liu ; Huajun Chen

Recently decades have witnessed the empirical success of framing Knowledge Graph (KG) embeddings via language models. However, language model-based KG embeddings are usually deployed as static artifacts, making them difficult to modify post-deployment without re-training after deployment. To address this issue, we propose a new task of editing language model-based KG embeddings in this paper. This task is designed to facilitate rapid, data-efficient updates to KG embeddings without compromising the performance of other aspects. We build four new datasets: E-FB15k237, A-FB15k237, E-WN18RR, and A-WN18RR, and evaluate several knowledge editing baselines demonstrating the limited ability of previous models to handle the proposed challenging task. We further propose a simple yet strong baseline dubbed KGEditor, which utilizes additional parametric layers of the hypernetwork to edit/add facts. Our comprehensive experimental results reveal that KGEditor excels in updating specific facts without impacting the overall performance, even when faced with limited training resources. Code and datasets will be available at https://github.com/AnonymousForPapers/DeltaKG.

#24 Towards Multi-Intent Spoken Language Understanding via Hierarchical Attention and Optimal Transport [PDF] [Copy] [Kimi]

Authors: Xuxin Cheng ; Zhihong Zhu ; Hongxiang Li ; Yaowei Li ; Xianwei Zhuang ; Yuexian Zou

Multi-Intent spoken language understanding (SLU) can handle complicated utterances expressing multiple intents, which has attracted increasing attention from researchers. Although existing models have achieved promising performance, most of them still suffer from two leading problems: (1) each intent has its specific scope and the semantic information outside the scope might potentially hinder accurate predictions, i.e. scope barrier; (2) only the guidance from intent to slot is modeled but the guidance from slot to intent is often neglected, i.e. unidirectional guidance. In this paper, we propose a novel Multi-Intent SLU framework termed HAOT, which utilizes hierarchical attention to divide the scopes of each intent and applies optimal transport to achieve the mutual guidance between slot and intent. Experiments demonstrate that our model achieves state-of-the-art performance on two public Multi-Intent SLU datasets, obtaining the 3.4 improvement on MixATIS dataset compared to the previous best models in overall accuracy.

#25 Cooper: Coordinating Specialized Agents towards a Complex Dialogue Goal [PDF] [Copy] [Kimi]

Authors: Yi Cheng ; Wenge Liu ; Jian Wang ; Chak Tou Leong ; Yi Ouyang ; Wenjie Li ; Xian Wu ; Yefeng Zheng

In recent years, there has been a growing interest in exploring dialogues with more complex goals, such as negotiation, persuasion, and emotional support, which go beyond traditional service-focused dialogue systems. Apart from the requirement for much more sophisticated strategic reasoning and communication skills, a significant challenge of these tasks lies in the difficulty of objectively measuring the achievement of their goals in a quantifiable way, making it difficult for existing research to directly optimize the dialogue procedure towards them. In our work, we emphasize the multifaceted nature of complex dialogue goals and argue that it is more feasible to accomplish them by comprehensively considering and jointly promoting their different aspects. To this end, we propose a novel dialogue framework, Cooper, which coordinates multiple specialized agents, each dedicated to a specific dialogue goal aspect separately, to approach the complex objective. Through this divide-and-conquer manner, we make complex dialogue goals more approachable and elicit greater intelligence via the collaboration of individual agents. Experiments on persuasion and emotional support dialogues demonstrate the superiority of our method over a set of competitive baselines. Our codes are available at https://github.com/YiCheng98/Cooper.